Geospatial data is data that represents a particular geographic location.
As a result of the proliferation of remote sensing data, there is increasingly a wide availability of geospatial data.
In this workshop, we will learn how to use geospatial data in R.
It is possible to combine geospatial data with other data for conducting a data analysis.
Geospatial data comes in two basic types:
vector data represents points and shapes. For example, vector data could be used to locate cities and roads on a map, or to draw out the boundaries of an electoral district.
raster data is a grid that represents a variable over the full extent of a map. For example, it would be common to show the temperature or the elevation (altitude) across the landscape using raster data.
For geospatial data analysis, we will use the sf and terra packages. terra is especially for raster data.
These both integrate well with the tidyverse. You’ll need to install these with install.packages() the first time you use them. We will also use the tidyverse package.
Here is an example of vector data. This data comes from the City of Ottawa and shows locations of traffic collisions involving pedestrians or bicycles from 2017 to 2024. It also show the location of major roads in Ottawa, which is downloaded from Open Street Map. Notice how the vector data corresponds to points or shapes.
Read the roads data using st_read(). This is a special function for reading shapefiles (.shp). The data contains lines.
Read the traffic collisions data using st_read(). The data contains points where collisions took place. Filter to focus only on pedestrians and cyclists.
vec <- st_read("data/Traffic_Collision_Data/Traffic_Collision_Data.shp", quiet=TRUE) %>%
# filter out the erronous record
filter(Lat > 40) %>%
# Set the CRS to be the same in both data sets
st_transform(st_crs(ott_roads)) %>%
filter(Num_of_Bic > 0 | Num_Of_Ped > 0) %>%
mutate(type = case_when(Num_of_Bic > 0 ~ "bicycle",
Num_Of_Ped > 0 ~ "pedestrian"))Plot the data using geom_sf(). Notice how I can add layers on top of each other by specifying the data for each layer separately.
Here is an example of raster data. This data set comes from Natural Resources Canada. Different colours on the map correspond to different types of forestry management regime, including protected areas, Treaty/Settlement lands, private lands, crown forest lands, etc. Notice how every spot on the map is covered by the data.
Read the raster file (.tif) using the rast() function in the terra package.
Use the plot() function to make the map. Note that this is a different function than ggplot().
If we want to plot the data using ggplot() we need to convert the raster into a data frame. This is very slow, so it’s better to stick with the plot() function above.
Spatial data are (usually) data about places on the Earth.
The Earth is a three-dimensional ellipse, but it is convenient to display and conduct analysis in two dimensions.
This means that we need a way of projecting three dimensional data onto two dimensionals.
We can do this in lots of different ways. For example, two two maps below show a map of Canada using two different coordinate systems, or projections.
For conducting geospatial analysis, thinking about the CRS is important, for a couple of reasons.
First, all CRSs distort, since they are projecting a three dimensional shape on two dimensions. It’s important to choose an appropriate CRS for making plots. Usually, the two CRSs above are chosen for north american maps.
Second, when combining information from more than one geospatial source, it is critical to ensure that they are both projected using the same CRS.
You can find out the CRS of a data set by using st_crs(), set CRS using st_set_crs(), and you can transform a data set from one CRS to another by using st_transform().
Coordinate system 4326 is longitude/latitude.
Vector data is usually provided as a shapefile.
A shapefile usually is a compressed folder containing a number of files with identical names but different extensions.
They are all important, but you will read the one with the .shp extension.
You can use the function st_read() in the sf package to read vector data like a shapefile.
For our example, we will load a shapefile containing the provincial boundaries.
You can obtain this shapefile on the Statistics Canada website (download the digital, rather than cartographic file, because it is much smaller).
Once you download it onto your computer, you’ll need to unzip it (by right-clicking on it). You can read it using st_read()
Simple feature collection with 2 features and 8 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 7609465 ymin: 1587018 xmax: 9019157 ymax: 2991652
Projected CRS: NAD83 / Statistics Canada Lambert
PRUID DGUID PRNAME
1 10 2021A000210 Newfoundland and Labrador / Terre-Neuve-et-Labrador
2 11 2021A000211 Prince Edward Island / Île-du-Prince-Édouard
PRENAME PRFNAME PREABBR PRFABBR LANDAREA
1 Newfoundland and Labrador Terre-Neuve-et-Labrador N.L. T.-N.-L. 358170.373
2 Prince Edward Island Île-du-Prince-Édouard P.E.I. Î.-P.-É. 5681.179
geometry
1 MULTIPOLYGON (((7644465 298...
2 MULTIPOLYGON (((8427185 163...
Notice that the file looks similar to a standard tibble() object – it contains rows corresponding to observations and columns corresponding to variables.
In this case, each row is a province, and columns include things like the province name and the land area.
For geospatial objects like this one, there is a special column called geometry.
In this case, where the data is provincial boundaries, the geography column tells R exactly where the boundaries for each province lie.
Because the vector data looks like a tibble(), we can perform normal tidyverse operations (filter, mutate, …).
We can plot using ggplot() . geom_sf() is the layer type for this type of data.
We can also join the data set with other data.
The following code loads some data on the crime severity index in 2023, and joins it with the provincial boundary file.
There are lots of other things you can do with vector data.
To illustrate some of these we will use data from the City of Ottawa’s Parks and Greenspace shapefile as well as the Crime Statistics from the City of Ottawa’s Police Department.
The Parks file contains the shapes/areas of parks in the city, and the crime file contains points where crimes have taken place.
First, we load the parks data.
We will filter the data so that it only includes parks in the Somerset Ward, and show a picture of these parks, shaded according to their type.
It is possible to find the centroid of each of these parks using st_centroid().
We can also draw a buffer around each of the parks using st_buffer. We have to tell R the size of the buffer (e.g., 100m).
We can calculate the area of objects using st_area(). Here we calculate the area of all the parks and pick the biggest one.
somerset_parks <- somerset_parks %>%
mutate(park_area = st_area(.))
largest_park <- somerset_parks %>%
filter(park_area == max(somerset_parks$park_area)) %>%
select(NAME, park_area)
largest_parkSimple feature collection with 1 feature and 2 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: -8427341 ymin: 5685999 xmax: -8427031 ymax: 5686260
Projected CRS: WGS 84 / Pseudo-Mercator
NAME park_area geometry
1 McNabb Park 43024.09 [m^2] MULTIPOLYGON (((-8427308 56...
We can conduct distance calculations using st_distance(). This tells us the distance between two objects. Here we calculate the distance between the centroid of the smallest and largest park, for illustration.
Buffers, centroids, distances, areas.
Next, we overlay data on crimes.
We read data using st_read().
We focus again on crimes in Somerset ward, and restrict our attention to Assaults.
When we work with multiple data sets, we need to ensure they are projected onto the same CRS. Can use st_transform() to project onto a different CRS. We also use st_crop() to filter out any crimes that aren’t covered in our parks map.
Use a separate geom_sf() with a different data set to plot multiple geographic layers.
We can use st_intersection() to determine if one geographic layer intersects with another. For example st_intersection(crimes, park_buffers) will find which crimes occur within our park buffers.
# This data set only includes crimes near parks
near_crimes <- st_intersection(crimes, park_buffers) %>%
dplyr::select(crime_id) %>%
st_drop_geometry() %>%
mutate(near_park = TRUE)
# Now we have a data set where each crime is categorized as to whether it is near a park or not
crimes_parks <- full_join(crimes, near_crimes)
head(crimes_parks)Simple feature collection with 6 features and 18 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: -8427559 ymin: 5685803 xmax: -8425321 ymax: 5687693
Projected CRS: WGS 84 / Pseudo-Mercator
YEAR REP_DATE REP_HOUR OCC_DATE OCC_HOUR WEEKDAY
1 2018 2018-01-18 1800 2018-01-18 1600 Thursday
2 2018 2018-01-20 300 2018-01-20 300 Saturday
3 2018 2018-01-20 300 2018-01-20 300 Saturday
4 2018 2018-01-21 200 2018-01-21 2000 Sunday
5 2018 2018-01-22 1700 2018-01-20 2200 Saturday
6 2018 2018-01-22 1700 2018-01-20 2200 Saturday
OFF_SUM OFF_CATEG NB_NAME_EN SECTOR DIVISION
1 Crimes Against The Person (1000) Assaults West Centertown Sector 22 Central
2 Crimes Against The Person (1000) Assaults Centretown Sector 23 Central
3 Crimes Against The Person (1000) Assaults Centretown Sector 23 Central
4 Crimes Against The Person (1000) Assaults Centretown Sector 23 Central
5 Crimes Against The Person (1000) Assaults Centretown Sector 23 Central
6 Crimes Against The Person (1000) Assaults Centretown Sector 23 Central
CENSUS_TRC TOD WARD COUNCILLOR
1 5050042.00 Afternoon Ward 14 - Somerset Ariel Troster
2 5050037.00 Night Ward 14 - Somerset Ariel Troster
3 5050037.00 Night Ward 14 - Somerset Ariel Troster
4 5050049.00 Evening Ward 14 - Somerset Ariel Troster
5 5050037.00 Evening Ward 14 - Somerset Ariel Troster
6 5050037.00 Evening Ward 14 - Somerset Ariel Troster
INTERSECTI crime_id near_park geometry
1 BELL ST, GLADSTONE AVE 1 NA POINT (-8427559 5685803)
2 ELGIN ST, GILMOUR ST 2 TRUE POINT (-8425688 5687338)
3 ELGIN ST, GILMOUR ST 2 TRUE POINT (-8425688 5687338)
4 MACLAREN ST, MACDONALD ST 3 TRUE POINT (-8425321 5687693)
5 JACK PURCELL LANE, LEWIS ST 4 TRUE POINT (-8425707 5687253)
6 JACK PURCELL LANE, LEWIS ST 4 TRUE POINT (-8425707 5687253)
Let’s see if the assaults that occur near parks (within the 100m) buffer are of different characteristics that the assaults that are further from parks. In particular, we test whether assaults that have occurred near parks are more likely to have occurred at night. We conduct this analysis using a t-test, and also make a summary figure. This is an example of how you can combine geospatial data with statistical analysis.
We can now use our extracted data in a statistical analysis to answer our research question.
# t-test to see if crimes are more likely at night near parks
m1 <- lm(night ~ near_park, data=night_crimes)
modelsummary::modelsummary(m1, gof_map = c("nobs", "r.squared"), stars = TRUE)| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 0.330*** |
| (0.010) | |
| near_parkTRUE | 0.129*** |
| (0.026) | |
| Num.Obs. | 2595 |
| R2 | 0.009 |
Raster data is a bit different to vector data.
Rather than representing points, lines, or shapes, raster data is represented by a grid, and captures conditions continuously across the extent of the geospatial data.
For example, raster data may represent elevation, or land use, or temperature.
[1] "PM25"
Raster data is read using the terra package (need to install first). Read a raster using the rast() function. Plot a raster using plot().
plot() produces a nice raster plot. But it doesn’t use the ggplot() syntax we are used to, which means we can’t easily add other layers. To use ggplot, we need to convert the raster to a data frame.
The most common thing to do with raster data is to extract() it.
This operation reveals the level of the raster variable corresponding to some vector.
For example, it would be possible to get the maximum pollution level in all cities in the world (by extracting the pollution raster over city points) or the average pollution level in different countries or regions.
To extract() data from a raster over a vector, need an sf object.
Use extract to obtain pollution in each city.
We can extract data over different types of vectors - points, lines, polygons
We will do the latter and calculate average pollution in African countries. We begin by loading up shapefiles for African countries. This is a vector file as above.
We then use the extract function to pull out the level of pollution in different countries. There are many raster grid cells in each country, and we are taking the mean of these (see the reference to mean in the function call).
country_air_pollution <- cp %>%
as_tibble() %>%
dplyr::select(ADM0_NAME, ISO3, pm2021)
head(country_air_pollution)# A tibble: 6 × 3
ADM0_NAME ISO3 pm2021
<chr> <chr> <dbl>
1 Sudan SDN 32.0
2 Angola AGO 18.0
3 Benin BEN 40.3
4 Botswana BWA 11.3
5 Burkina Faso BFA 35.6
6 Cameroon CMR 33.9
We can then plot the results.
As before, we can combine this with other data for analysis.
Here, I will combine the air pollution concentration data with data on GDP per capita, and see if there’s a relationship between the two.
Data can be downloaded directly from World Bank
# Obtain GDP per capita data from the world bank.
library(wbstats)
# wb_search("GDP per capita") to find data series
gdp <- wb_data("NY.GDP.PCAP.CD") %>%
# Same year as air pollution data
filter(date == 2021) %>%
dplyr::select(ISO3 = iso3c, gdp_per_capita = NY.GDP.PCAP.CD)
dat <- inner_join(country_air_pollution, gdp)Is higher GDP per capita associated with less pollution? Is it causal?
| (1) | |
|---|---|
| + p < 0.1, * p < 0.05, ** p < 0.01, *** p < 0.001 | |
| (Intercept) | 5019.672*** |
| (825.107) | |
| pm2021 | -94.010** |
| (29.279) | |
| Num.Obs. | 52 |
| R2 | 0.171 |